# A tibble: 4 × 2
region n
<chr> <int>
1 Midwest 1101
2 Northeast 3224
3 South 1111
4 West 5095
MKTG 585R Quantitative Marketing Pre-PhD Seminar
The go-to statistic for a discrete variable is a count.
The obvious statistic for a continuous variable is a mean.
Note that summarize() is a more general version of count().
We can also compute the mode, median, variance, standard deviation, minimum, maximum, sum, etc.
{ggplot2} provides a consistent grammar of graphics built with layers.
Plot our first summary (note how + is different from |>).
Facets allow us to visualize by another discrete variable. For example, is this relationship different depending on gender?
It’s no longer a count on the y-axis. Let’s change the labels.
customer_data |>
count(region, college_degree, gender) |>
ggplot(aes(x = region, y = n, fill = college_degree)) +
geom_col(position = "fill") +
facet_wrap(~ gender) +
labs(
title = "Proportion of Customers with College Degrees by Region and Gender",
subtitle = "Based on 10,531 Customers in the CRM Database",
x = "Region",
y = "Proportion"
)What about the legend? And these colors?
customer_data |>
count(region, college_degree, gender) |>
ggplot(aes(x = region, y = n, fill = college_degree)) +
geom_col(position = "fill") +
facet_wrap(~ gender) +
labs(
title = "Proportion of Customers with College Degrees by Region and Gender",
subtitle = "Based on 10,531 Customers in the CRM Database",
x = "Region",
y = "Proportion"
) +
scale_fill_manual(
name = "College Degree",
values = c("royalblue", "darkblue")
)Let’s plot the distribution of income.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Visualize the relationship between income and credit.
Visualize the relationship between star_rating and income.
Warning: Removed 7372 rows containing missing values (`geom_point()`).
What do we do if there is overplotting? There’s a geom for that.
`geom_smooth()` using formula = 'y ~ x'
Grouped summaries provide a powerful solution for computing continuous statistics by discrete categories.
Note how the group_by() function is a lot like the facet_wrap(), it filters the data by each category in the discrete group variable.
count() is a wrapper around a grouped summary using n().
We can group by more than one discrete variable.
customer_data |>
group_by(gender, region) |>
summarize(
n = n(),
avg_income = mean(income),
avg_credit = mean(credit)
) |>
arrange(desc(avg_income))`summarise()` has grouped output by 'gender'. You can override using the
`.groups` argument.
# A tibble: 12 × 5
# Groups: gender [3]
gender region n avg_income avg_credit
<chr> <chr> <int> <dbl> <dbl>
1 Other Midwest 124 154637. 663.
2 Male Midwest 420 152467. 666.
3 Other Northeast 337 150564. 665.
4 Male Northeast 1285 150498. 665.
5 Male West 2079 149453. 667.
6 Other West 519 144420. 667.
7 Female Midwest 557 134083. 671.
8 Female West 2497 133819. 668.
9 Female Northeast 1602 133333. 669.
10 Other South 118 119068. 660.
11 Male South 430 117988. 669.
12 Female South 563 105888. 664.
We can also use slice_*() functions along with group_by().
# A tibble: 36 × 13
# Groups: gender, region [12]
customer_id birth_year gender income credit married college_degree region
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 6119 1984 Female 315000 698. No Yes Midwest
2 1299 1992 Female 306000 610. No Yes Midwest
3 7139 1957 Female 302000 727. No Yes Midwest
4 1040 1993 Female 356000 672. Yes Yes Northeast
5 6503 1997 Female 348000 599. No Yes Northeast
6 11249 1992 Female 343000 620. No Yes Northeast
7 2075 1989 Female 374000 578. No Yes South
8 7128 1966 Female 301000 708. Yes Yes South
9 4756 1977 Female 293000 790. No Yes South
10 7366 1993 Female 376000 682. No Yes West
# ℹ 26 more rows
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
# review_title <chr>, review_text <chr>
Just like there are geoms for visualizing continuous or discrete data, there are geoms for visualizing the relationship between continuous and discrete data.
Tidy data is defined as follows:
This may seem obvious or simple, but this common philosophy is at the heart of the tidyverse. It also means we will often prefer longer datasets to wider datasets and {tidyr} will help us move between the two.
The most common problem with messy data is when column names are really values.
# A tibble: 10,531 × 14
region customer_id jan_2018 feb_2018 mar_2018 apr_2018 may_2018 jun_2018
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 South 1001 1 0 0 0 0 4
2 West 1002 0 0 0 0 0 0
3 South 1003 4 0 0 0 0 0
4 Midwest 1004 0 0 0 2 0 0
5 West 1005 0 1 0 0 0 0
6 Midwest 1006 0 0 0 0 0 0
7 Midwest 1007 0 0 0 4 0 0
8 South 1008 0 0 0 0 0 0
9 West 1009 0 0 0 0 1 0
10 Northeast 1010 0 0 0 0 0 0
# ℹ 10,521 more rows
# ℹ 6 more variables: jul_2018 <dbl>, aug_2018 <dbl>, sep_2018 <dbl>,
# oct_2018 <dbl>, nov_2018 <dbl>, dec_2018 <dbl>
When column names are really values, the data frame ends up being wider than it should be. Use pivot_longer() to pivot the data frame longer by turning column names into values.
Note how much longer the data frame is and why.
# A tibble: 126,372 × 4
region customer_id month_year transactions
<chr> <dbl> <chr> <dbl>
1 South 1001 jan_2018 1
2 South 1001 feb_2018 0
3 South 1001 mar_2018 0
4 South 1001 apr_2018 0
5 South 1001 may_2018 0
6 South 1001 jun_2018 4
7 South 1001 jul_2018 0
8 South 1001 aug_2018 0
9 South 1001 sep_2018 0
10 South 1001 oct_2018 0
# ℹ 126,362 more rows
Now summarizing the transactions for 2018 by region is trivial.
If the data has the opposite problem and has values that should really be column names, use pivot_wider() to pivot the data frame wider by turning values into column names.
# A tibble: 10,531 × 14
region customer_id jan_2018 feb_2018 mar_2018 apr_2018 may_2018 jun_2018
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 South 1001 1 0 0 0 0 4
2 West 1002 0 0 0 0 0 0
3 South 1003 4 0 0 0 0 0
4 Midwest 1004 0 0 0 2 0 0
5 West 1005 0 1 0 0 0 0
6 Midwest 1006 0 0 0 0 0 0
7 Midwest 1007 0 0 0 4 0 0
8 South 1008 0 0 0 0 0 0
9 West 1009 0 0 0 0 1 0
10 Northeast 1010 0 0 0 0 0 0
# ℹ 10,521 more rows
# ℹ 6 more variables: jul_2018 <dbl>, aug_2018 <dbl>, sep_2018 <dbl>,
# oct_2018 <dbl>, nov_2018 <dbl>, dec_2018 <dbl>
If two (or more) values are in one column, separate() the values into two (or more) columns.
# A tibble: 126,372 × 5
region customer_id month year transactions
<chr> <dbl> <chr> <chr> <dbl>
1 South 1001 jan 2018 1
2 South 1001 feb 2018 0
3 South 1001 mar 2018 0
4 South 1001 apr 2018 0
5 South 1001 may 2018 0
6 South 1001 jun 2018 4
7 South 1001 jul 2018 0
8 South 1001 aug 2018 0
9 South 1001 sep 2018 0
10 South 1001 oct 2018 0
# ℹ 126,362 more rows
Now we can summarize the transactions for 2018 by month and region.
crm_long |>
group_by(month, region) |>
summarize(
total_transactions = sum(transactions),
avg_transactions = mean(transactions)
) |>
arrange(desc(avg_transactions))`summarise()` has grouped output by 'month'. You can override using the
`.groups` argument.
# A tibble: 48 × 4
# Groups: month [12]
month region total_transactions avg_transactions
<chr> <chr> <dbl> <dbl>
1 nov South 1812 1.63
2 dec Midwest 1723 1.56
3 dec West 7922 1.55
4 dec Northeast 4928 1.53
5 dec South 1693 1.52
6 nov West 7656 1.50
7 nov Northeast 4824 1.50
8 nov Midwest 1646 1.50
9 apr Midwest 751 0.682
10 may South 749 0.674
# ℹ 38 more rows
When two (or more) values should be in one column, unite() the values into one column.
# A tibble: 126,372 × 4
region customer_id month_year transactions
<chr> <dbl> <chr> <dbl>
1 South 1001 jan_2018 1
2 South 1001 feb_2018 0
3 South 1001 mar_2018 0
4 South 1001 apr_2018 0
5 South 1001 may_2018 0
6 South 1001 jun_2018 4
7 South 1001 jul_2018 0
8 South 1001 aug_2018 0
9 South 1001 sep_2018 0
10 South 1001 oct_2018 0
# ℹ 126,362 more rows
We’ve been using data frames (technically tibbles, the tidyverse data frame). A data frame is composed of columns called vectors. Both data frames and vectors are classes of data.
Each vector has a single data type. Data types include double (i.e., numeric), integer, date, character, and factor. Discrete data is often a factor, which includes both integer levels and character labels.
If we try to mix data types in a vector, it will pick the easiest one to satisfy.
Sometimes we need to coerce a data class or type, for example as_tibble(). We can similarly coerce data types with as.*() functions (e.g., as.numeric() and as.character()).
However, coercing dates is very tricky (see supplementary material for details) while we often want to coerce factors using the fct_*() functions.
A list is like a vector where each entry in the vector can be its own data class. It is the most general data class in R.
We also have matrices, the two-dimensional extension of vectors, and arrays, matrices with more than two dimensions.
Indexing (subsetting, selecting) is about picking part of an object.
# A tibble: 10,531 × 180
customer_id birth_year gender credit married college_degree region state
<dbl> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
1 1001 1971 Female 742. No No South DC
2 1002 1970 Female 749. Yes No West WA
3 1003 1988 Male 542. No No South AR
4 1004 1984 Other 574. Yes Yes Midwest MN
5 1005 1987 Male 644. No Yes West HI
6 1006 1994 Male 554. Yes Yes Midwest MN
7 1007 1968 Male 608. No No Midwest MN
8 1008 1994 Male 710. No No South KY
9 1009 1958 Male 702. No No West NM
10 1010 1994 Female 605. Yes No Northeast VT
# ℹ 10,521 more rows
# ℹ 172 more variables: star_rating <dbl>, review_time <chr>,
# review_title <chr>, review_text <chr>, jan_2005 <dbl>, feb_2005 <dbl>,
# mar_2005 <dbl>, apr_2005 <dbl>, may_2005 <dbl>, jun_2005 <dbl>,
# jul_2005 <dbl>, aug_2005 <dbl>, sep_2005 <dbl>, oct_2005 <dbl>,
# nov_2005 <dbl>, dec_2005 <dbl>, jan_2006 <dbl>, feb_2006 <dbl>,
# mar_2006 <dbl>, apr_2006 <dbl>, may_2006 <dbl>, jun_2006 <dbl>, …
# A tibble: 10,531 × 13
customer_id birth_year gender income credit married college_degree region
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 1001 1971 Female 73000 742. No No South
2 1002 1970 Female 31000 749. Yes No West
3 1003 1988 Male 35000 542. No No South
4 1004 1984 Other 64000 574. Yes Yes Midwest
5 1005 1987 Male 58000 644. No Yes West
6 1006 1994 Male 164000 554. Yes Yes Midwest
7 1007 1968 Male 39000 608. No No Midwest
8 1008 1994 Male 69000 710. No No South
9 1009 1958 Male 233000 702. No No West
10 1010 1994 Female 77000 605. Yes No Northeast
# ℹ 10,521 more rows
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
# review_title <chr>, review_text <chr>
The summary() function is a generic function that summarizes an object in a way that is (usually) appropriate for the type of data.
Min. 1st Qu. Median Mean 3rd Qu. Max.
18000 83000 139000 138623 187000 470000
Length Class Mode
[1,] 13 spec_tbl_df list
[2,] 169 spec_tbl_df list
Keep tabs on the dimensions of an object using length() and (more generally) str().
R treats missing values in a particular way: NA.
Or the tidyverse way:
The plot() function is another generic function that handles an object in a way that is (usually) appropriate for the type of data.
There are other flavors of plot().
Base R plotting can be deceptively straightforward. Using {ggplot2} can feel very verbose in comparison, but it’s also more explicit. The actual plot that’s created is stored very differently. While {ggplot2} adds layers onto a PNG, you can layer different plot()-adjacent functions to draw ontop of the previous plot.
If you have to copy and paste something more than twice you should consider coding the iteration. This is usually in the form of a for loop (though you don’t have to write it, see apply() and map()).
Sometimes you want code to run only when certain conditions are met. To do this, use conditional statements. Note that these are a separate idea from for loops, but a for loop can be a great place to use them!
While a for loop is a powerful way to code an iteration and conditional statements give us even more flexibility, we might still need to copy and paste more than twice. What we often want is a function.
A few things to note:
return() to get back a specific output.debugonce() and then call the function.Prepare a five-to-ten minute presentation on the paper you’ve selected and read. Note that if you want include links the notation is [link name](URL) and an image is . You can also use $s around math or $$s around blocks of math.
Finally, if you modify the header you can specify where any figures you create will be saved (this will be especially important in future assignments).
---
title: "Paper Title"
format: revealjs
knitr:
opts_chunk:
fig.path: "../Figures/"
---
The name of the figures will match the label of the code block in which it was created (i.e., #| label: fig-ppd).